For the first analysis module, our goal will be to predict quantitative variable. Since the dataset is more suitable for classification, only a single quantitative analysis will be made.
To start the analysis, let’s import the required dataset; due to the model needs, the aggregated dataset will be used.
# Load dataset from cache
if (file.exists("Cache/traffic_hourly.rds") && file.exists("Cache/sf_gatesGPS.rds")) {
sf_gatesGPS <- readRDS("Cache/sf_gatesGPS.rds")
df_hourly <- readRDS("Cache/traffic_hourly.rds")
# Change id_varco to its real location
df_hourly <- df_hourly %>%
left_join(sf_gatesGPS %>%
st_drop_geometry() %>%
dplyr::select(id_amat, label),
by = c("id_varco" = "id_amat"))
df_hourly$id_varco <- NULL
df_hourly$label <- as.factor(df_hourly$label)
message("Data loaded successfully.")
} else {
stop("Run step 01 first.")
}
## Data loaded successfully.
Since we’re starting to predict, we’ll need some data to verify if our model will work also with data not included in the training phase. To do so, we split the database in two giving 80% to the training phase while reserving 20% for the tests.
# Split Train/Test (80/20)
set.seed(123)
train_index <- sample(1:nrow(df_hourly), 0.8 * nrow(df_hourly))
train_data <- df_hourly[train_index, ]
test_data <- df_hourly[-train_index, ]
rm(df_hourly, train_index)
# Remove NA (they'reare just 23)
train_data <- na.omit(train_data)
test_data <- na.omit(test_data)
What leads to more traffic?
The question we’ll try to answer is related to traffic volume prediction. We’ll start from a basic Multiple Linear Regression considering as predictors:
hour since will be probably the most influent
predictorday_of_week since from EDA we discovered how the
traffic differs from Weekday and Weekendsmonth to capture holidays and the season effectAs outcome, we’ll look for toral_transits.
This model will surely perform poorly since we’re considering the phenomenon as a linear one even if, from EDA, we found out that the traffic follows a “M” pattern. This model will simple act as a baseline to measure the improvements. Based on the chosen predictors, we’ll compute: \[TotalTransits = \beta_0 + \beta_1(Hour) + \beta_2(DayOfWeek) + \beta_3(Month)\epsilon\] [!NOTE] To avoid the Intercept to be represented by the first gate on Monday of January, we’ll switch to sum coding where the Intercept will be represented by the traffic global mean. Doing so, will allow us to interpret the influence of the other pretictor as the difference between itself and the mean value. From a prediction point of view this modification will not change anything, but from an inference point of view this will help us to answer better the research question.
# Change the contrast settings
options(contrasts = c("contr.sum", "contr.poly"))
# Model
linear_model <- lm(total_transits ~ hour + day_of_week + month, data = train_data)
summary(linear_model)
##
## Call:
## lm(formula = total_transits ~ hour + day_of_week + month, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -142.70 -79.90 -38.18 40.30 816.96
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.28634 1.67256 44.415 < 2e-16 ***
## hour 2.03760 0.03254 62.615 < 2e-16 ***
## day_of_week.L -11.58074 0.59806 -19.364 < 2e-16 ***
## day_of_week.Q -21.52406 0.59783 -36.003 < 2e-16 ***
## day_of_week.C -5.32076 0.59621 -8.924 < 2e-16 ***
## day_of_week^4 -2.76670 0.59602 -4.642 3.45e-06 ***
## day_of_week^5 0.47916 0.59471 0.806 0.420415
## day_of_week^6 0.03164 0.59582 0.053 0.957646
## month.L -33.24295 8.90958 -3.731 0.000191 ***
## month.Q -20.40645 9.71535 -2.100 0.035692 *
## month.C -15.33353 8.90941 -1.721 0.085243 .
## month^4 -44.87842 7.15729 -6.270 3.61e-10 ***
## month^5 -36.20081 5.10555 -7.090 1.34e-12 ***
## month^6 -10.02906 3.25804 -3.078 0.002082 **
## month^7 19.47489 1.89976 10.251 < 2e-16 ***
## month^8 6.70016 1.11666 6.000 1.97e-09 ***
## month^9 -13.27106 0.81866 -16.211 < 2e-16 ***
## month^10 -22.06306 0.75342 -29.284 < 2e-16 ***
## month^11 -18.63772 0.74640 -24.970 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 114.2 on 257222 degrees of freedom
## Multiple R-squared: 0.04044, Adjusted R-squared: 0.04037
## F-statistic: 602.2 on 18 and 257222 DF, p-value: < 2.2e-16
Unfortunaly, we’re not be able to see, for example, the comparison of all the days with the Intercept since teh last level (n-1) is implicit. By default, R will “hide” the last item. To be sure, we can print the levels and identify the last one (Sunday and December will be the ghosts).
levels(train_data$day_of_week)
## [1] "Mon" "Tue" "Wed" "Thu" "Fri" "Sat" "Sun"
levels(train_data$month)
## [1] "January" "February" "March" "April" "May" "June"
## [7] "July" "August" "September" "October" "November" "December"
As expected, the model performed really poorly:
Looking at the R-Squared Adjusted the multiple linear regression model explain only the 4% of all the traffic variability; the remaining 96% is considered noise or lost information.
The Residual Standard Error is confirming that the model is unable to predict correctly the traffic by having an error of 114 vehicle over the 74 vehicles represented in the Intercept. It’s basically useless.
However, the (almost) all thep-values have a really low value; the time is surelly influencing the traffic.
The residuals and the negative median (-38) indicate that the model really underestimate the traffic while being unable to detects the peaks. To be fair, we were aware of the non-linearity of the model since the EDA has clearly shown a “M” shape that a simple model like this one can’t simply model.
We really need to consider the non-linearity nature of the data. Here, we’ll give the line to bend by using the Polynomial Regression:
\[TotalTransits = \beta_0 + \beta_1(Hour) + \beta_2(Hour^2) + \beta_3(Hour^3) + \beta_4(Month) + \beta_5(DayOfWeek)...\] The quadratic term will alow a parabolic line (an ascend and a descend) while the cubic term will allow for two curves (one in the morning, one in the afternoon).
# Change the contrast settings
options(contrasts = c("contr.sum", "contr.poly"))
poly_model <- lm(total_transits ~ poly(hour, 4) + day_of_week + month, data = train_data)
summary(poly_model)
##
## Call:
## lm(formula = total_transits ~ poly(hour, 4) + day_of_week + month,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -170.79 -67.65 -22.60 37.69 789.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.043e+02 1.497e+00 69.635 < 2e-16 ***
## poly(hour, 4)1 7.144e+03 1.053e+02 67.835 < 2e-16 ***
## poly(hour, 4)2 -2.240e+04 1.053e+02 -212.636 < 2e-16 ***
## poly(hour, 4)3 1.140e+03 1.053e+02 10.825 < 2e-16 ***
## poly(hour, 4)4 1.058e+03 1.053e+02 10.041 < 2e-16 ***
## day_of_week.L -1.183e+01 5.514e-01 -21.455 < 2e-16 ***
## day_of_week.Q -2.161e+01 5.511e-01 -39.210 < 2e-16 ***
## day_of_week.C -5.252e+00 5.496e-01 -9.554 < 2e-16 ***
## day_of_week^4 -2.823e+00 5.495e-01 -5.138 2.77e-07 ***
## day_of_week^5 3.181e-01 5.483e-01 0.580 0.561752
## day_of_week^6 -2.713e-02 5.493e-01 -0.049 0.960610
## month.L 2.861e+00 8.220e+00 0.348 0.727777
## month.Q 1.890e+01 8.964e+00 2.108 0.034996 *
## month.C 2.091e+01 8.220e+00 2.544 0.010961 *
## month^4 -1.576e+01 6.604e+00 -2.387 0.016984 *
## month^5 -1.575e+01 4.711e+00 -3.345 0.000824 ***
## month^6 2.766e+00 3.006e+00 0.920 0.357401
## month^7 2.636e+01 1.753e+00 15.043 < 2e-16 ***
## month^8 1.016e+01 1.030e+00 9.864 < 2e-16 ***
## month^9 -1.207e+01 7.548e-01 -15.989 < 2e-16 ***
## month^10 -2.155e+01 6.946e-01 -31.030 < 2e-16 ***
## month^11 -1.841e+01 6.881e-01 -26.758 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 105.3 on 257219 degrees of freedom
## Multiple R-squared: 0.1845, Adjusted R-squared: 0.1844
## F-statistic: 2771 on 21 and 257219 DF, p-value: < 2.2e-16
Introducing the flexibility offered by the polynomial model, we can see a small increase of the R-Squared Adjusted that is now 18%. It’s not good, but it confirms that the traffic isn’t a cumulative phenomenon but cyclic.
The coefficients poly(hour, 4) are
offering a more realistic scenario:
poly(hour, 4)2 has the highest value of -2.240e+04,
indicating a inverded “U” probably representing the growing curve of the
morning traffic respect the pacific nightThe errors represented by the **Residual Standard Error* are sill high: 105 is better than the previous 114 but not sufficient to be considered good. This is clearly saying that using an unique mean for Milan is not a good idea. Maybe is because we’re considering a mean of all the gates?
With the 3rd approach, we’ll combine two improvement:
label with equal importance# Change the contrast settings
options(contrasts = c("contr.sum", "contr.poly"))
# Model
interact_model <- lm(total_transits ~ poly(hour, 4) * day_of_week + month + label, data = train_data)
summary(interact_model)
##
## Call:
## lm(formula = total_transits ~ poly(hour, 4) * day_of_week + month +
## label, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -315.41 -35.89 -4.78 30.44 638.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.048e+02 8.816e-01 118.925 < 2e-16 ***
## poly(hour, 4)1 7.193e+03 6.164e+01 116.694 < 2e-16 ***
## poly(hour, 4)2 -2.241e+04 6.164e+01 -363.599 < 2e-16 ***
## poly(hour, 4)3 1.091e+03 6.164e+01 17.696 < 2e-16 ***
## poly(hour, 4)4 1.033e+03 6.164e+01 16.759 < 2e-16 ***
## day_of_week.L -1.211e+01 3.227e-01 -37.522 < 2e-16 ***
## day_of_week.Q -2.151e+01 3.226e-01 -66.693 < 2e-16 ***
## day_of_week.C -5.208e+00 3.217e-01 -16.190 < 2e-16 ***
## day_of_week^4 -2.830e+00 3.216e-01 -8.801 < 2e-16 ***
## day_of_week^5 4.547e-01 3.209e-01 1.417 0.156469
## day_of_week^6 5.090e-02 3.215e-01 0.158 0.874191
## month.L 5.401e+00 4.840e+00 1.116 0.264418
## month.Q 2.124e+01 5.278e+00 4.024 5.72e-05 ***
## month.C 2.367e+01 4.840e+00 4.891 1.00e-06 ***
## month^4 -1.369e+01 3.888e+00 -3.520 0.000431 ***
## month^5 -1.404e+01 2.773e+00 -5.064 4.10e-07 ***
## month^6 4.205e+00 1.769e+00 2.376 0.017479 *
## month^7 2.687e+01 1.031e+00 26.064 < 2e-16 ***
## month^8 1.030e+01 6.048e-01 17.033 < 2e-16 ***
## month^9 -1.212e+01 4.422e-01 -27.403 < 2e-16 ***
## month^10 -2.169e+01 4.066e-01 -53.359 < 2e-16 ***
## month^11 -1.830e+01 4.027e-01 -45.447 < 2e-16 ***
## label1 8.330e+01 7.576e-01 109.957 < 2e-16 ***
## label2 1.695e+01 7.591e-01 22.323 < 2e-16 ***
## label3 -8.056e+01 7.562e-01 -106.540 < 2e-16 ***
## label4 -9.341e+01 7.587e-01 -123.126 < 2e-16 ***
## label5 -1.831e+01 7.553e-01 -24.245 < 2e-16 ***
## label6 7.280e+01 7.611e-01 95.641 < 2e-16 ***
## label7 -2.474e+01 7.602e-01 -32.547 < 2e-16 ***
## label8 1.169e+02 7.599e-01 153.790 < 2e-16 ***
## label9 -2.880e+01 7.594e-01 -37.926 < 2e-16 ***
## label10 -1.082e+01 7.608e-01 -14.222 < 2e-16 ***
## label11 -3.857e+01 7.554e-01 -51.054 < 2e-16 ***
## label12 -8.488e+01 7.599e-01 -111.698 < 2e-16 ***
## label13 1.154e+02 7.606e-01 151.675 < 2e-16 ***
## label14 -5.750e+01 7.598e-01 -75.676 < 2e-16 ***
## label15 -8.705e+01 7.599e-01 -114.555 < 2e-16 ***
## label16 7.932e+01 7.605e-01 104.287 < 2e-16 ***
## label17 -3.486e+01 7.599e-01 -45.881 < 2e-16 ***
## label18 -4.872e+01 7.579e-01 -64.284 < 2e-16 ***
## label19 1.283e+02 7.629e-01 168.168 < 2e-16 ***
## label20 -9.094e+01 7.619e-01 -119.354 < 2e-16 ***
## label21 -9.441e+01 7.556e-01 -124.947 < 2e-16 ***
## label22 1.064e+02 7.576e-01 140.397 < 2e-16 ***
## label23 -5.102e+00 7.613e-01 -6.702 2.06e-11 ***
## label24 9.384e+00 7.561e-01 12.411 < 2e-16 ***
## label25 -6.812e+01 7.600e-01 -89.623 < 2e-16 ***
## label26 -3.899e+01 7.577e-01 -51.463 < 2e-16 ***
## label27 4.093e+01 7.574e-01 54.032 < 2e-16 ***
## label28 5.119e+01 7.588e-01 67.463 < 2e-16 ***
## label29 -8.052e+01 7.608e-01 -105.836 < 2e-16 ***
## label30 -8.166e+01 7.560e-01 -108.013 < 2e-16 ***
## label31 1.430e+02 7.631e-01 187.454 < 2e-16 ***
## label32 -7.922e+01 7.569e-01 -104.664 < 2e-16 ***
## label33 -9.332e+01 7.598e-01 -122.824 < 2e-16 ***
## label34 -6.407e+00 7.570e-01 -8.463 < 2e-16 ***
## label35 -6.971e+01 7.575e-01 -92.031 < 2e-16 ***
## label36 2.473e+02 7.554e-01 327.393 < 2e-16 ***
## label37 1.641e+02 7.592e-01 216.080 < 2e-16 ***
## label38 -2.401e+01 7.596e-01 -31.609 < 2e-16 ***
## label39 7.214e-01 7.556e-01 0.955 0.339748
## poly(hour, 4)1:day_of_week.L 6.811e+03 1.634e+02 41.693 < 2e-16 ***
## poly(hour, 4)2:day_of_week.L 7.575e+03 1.633e+02 46.383 < 2e-16 ***
## poly(hour, 4)3:day_of_week.L -9.423e+03 1.633e+02 -57.690 < 2e-16 ***
## poly(hour, 4)4:day_of_week.L 3.626e+03 1.633e+02 22.197 < 2e-16 ***
## poly(hour, 4)1:day_of_week.Q -2.113e+02 1.634e+02 -1.293 0.196019
## poly(hour, 4)2:day_of_week.Q 3.190e+03 1.635e+02 19.517 < 2e-16 ***
## poly(hour, 4)3:day_of_week.Q -6.861e+03 1.634e+02 -41.979 < 2e-16 ***
## poly(hour, 4)4:day_of_week.Q 1.978e+03 1.634e+02 12.100 < 2e-16 ***
## poly(hour, 4)1:day_of_week.C -1.229e+03 1.630e+02 -7.538 4.80e-14 ***
## poly(hour, 4)2:day_of_week.C -2.071e+03 1.630e+02 -12.704 < 2e-16 ***
## poly(hour, 4)3:day_of_week.C -2.842e+03 1.631e+02 -17.426 < 2e-16 ***
## poly(hour, 4)4:day_of_week.C -9.318e+00 1.630e+02 -0.057 0.954418
## poly(hour, 4)1:day_of_week^4 -2.008e+03 1.630e+02 -12.318 < 2e-16 ***
## poly(hour, 4)2:day_of_week^4 -1.051e+03 1.629e+02 -6.453 1.10e-10 ***
## poly(hour, 4)3:day_of_week^4 1.773e+01 1.629e+02 0.109 0.913321
## poly(hour, 4)4:day_of_week^4 -1.253e+03 1.630e+02 -7.685 1.54e-14 ***
## poly(hour, 4)1:day_of_week^5 -1.545e+03 1.627e+02 -9.498 < 2e-16 ***
## poly(hour, 4)2:day_of_week^5 -2.220e+02 1.628e+02 -1.364 0.172682
## poly(hour, 4)3:day_of_week^5 1.910e+03 1.628e+02 11.734 < 2e-16 ***
## poly(hour, 4)4:day_of_week^5 -1.583e+03 1.627e+02 -9.728 < 2e-16 ***
## poly(hour, 4)1:day_of_week^6 -6.796e+01 1.630e+02 -0.417 0.676675
## poly(hour, 4)2:day_of_week^6 2.875e+02 1.631e+02 1.763 0.077923 .
## poly(hour, 4)3:day_of_week^6 9.015e+02 1.630e+02 5.530 3.20e-08 ***
## poly(hour, 4)4:day_of_week^6 -6.211e+02 1.630e+02 -3.810 0.000139 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.62 on 257156 degrees of freedom
## Multiple R-squared: 0.7208, Adjusted R-squared: 0.7207
## F-statistic: 7902 on 84 and 257156 DF, p-value: < 2.2e-16
THis approach outsmarts all the previous. The synergy between time
and space is the key: the Residual Standard Error is
now decreased to 61; not the perfect result, but a significant
improvement: the label36 itself has an estimate of 247.3
just by itself. Introducing the interaction has bring interesting
results noticeable looking at poly(hour, 4)1:day_of_week
results. We’re now able to say that the shape of the day changes between
the week and, by looking at the p-values, that the peaks are not
standard or fixed.
Overall the R-Squared Adjusted of 72% is a great result, but the residuals still show a maximum error of 638: Milan’s traffic it’s almost never the same.
For a quick moment, we’ll step out from linear model in favor of a Generalized Adattive Model. GAMs offer a very flexible spline that may represent the solution for this question. As bonus analysis we’ll try to predict the total transits as a spline while keeping the interaction creating a different curve for every different day of week.
Please note that instead of using the default Gaussian Distribution as family, we’ll ue the Negative Binomial one. This choice has been made since the Gaussian one “suffers” the presence of very large outliers. Traffic data is counted and often suffers from overdispersion, but in a Gaussian distribution the variance is constant. In real traffic data variance tends to increase as traffic increases: when there is little traffic, the variance is low; when there is a lot of traffic, the variance explodes. Using Negative Binomial should handle all the spikes that the Gaussian tends to ignore or treat as errors.
gam_model <- gam(total_transits ~ s(hour, by = day_of_week, k = 20) +
day_of_week + month + label,
data = train_data,
family = nb(),
method = "REML")
summary(gam_model)
##
## Family: Negative Binomial(5.342)
## Link function: log
##
## Formula:
## total_transits ~ s(hour, by = day_of_week, k = 20) + day_of_week +
## month + label
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.026840 0.007017 573.851 < 2e-16 ***
## day_of_week.L -0.209191 0.002453 -85.286 < 2e-16 ***
## day_of_week.Q -0.085270 0.002453 -34.769 < 2e-16 ***
## day_of_week.C -0.216520 0.002453 -88.285 < 2e-16 ***
## day_of_week^4 0.016604 0.002457 6.758 1.40e-11 ***
## day_of_week^5 -0.029881 0.002453 -12.180 < 2e-16 ***
## day_of_week^6 0.009069 0.002458 3.690 0.000224 ***
## month.L 0.055273 0.038540 1.434 0.151523
## month.Q 0.278171 0.042030 6.618 3.63e-11 ***
## month.C 0.292335 0.038540 7.585 3.32e-14 ***
## month^4 -0.135920 0.030954 -4.391 1.13e-05 ***
## month^5 -0.149363 0.022068 -6.768 1.30e-11 ***
## month^6 0.037129 0.014055 2.642 0.008250 **
## month^7 0.300040 0.008152 36.807 < 2e-16 ***
## month^8 0.140069 0.004712 29.725 < 2e-16 ***
## month^9 -0.129219 0.003369 -38.359 < 2e-16 ***
## month^10 -0.279213 0.003099 -90.104 < 2e-16 ***
## month^11 -0.220104 0.003064 -71.835 < 2e-16 ***
## label1 0.996342 0.005459 182.499 < 2e-16 ***
## label2 0.512138 0.005539 92.452 < 2e-16 ***
## label3 -1.117396 0.006126 -182.408 < 2e-16 ***
## label4 -1.914765 0.006866 -278.894 < 2e-16 ***
## label5 0.279371 0.005559 50.258 < 2e-16 ***
## label6 0.868142 0.005500 157.838 < 2e-16 ***
## label7 0.087067 0.005637 15.445 < 2e-16 ***
## label8 1.194323 0.005454 218.973 < 2e-16 ***
## label9 0.042157 0.005646 7.466 8.25e-14 ***
## label10 0.313198 0.005591 56.017 < 2e-16 ***
## label11 -0.091554 0.005653 -16.195 < 2e-16 ***
## label12 -1.271815 0.006262 -203.093 < 2e-16 ***
## label13 1.229246 0.005455 225.363 < 2e-16 ***
## label14 -0.320342 0.005757 -55.641 < 2e-16 ***
## label15 -1.390716 0.006349 -219.056 < 2e-16 ***
## label16 0.997340 0.005480 182.005 < 2e-16 ***
## label17 0.104445 0.005629 18.553 < 2e-16 ***
## label18 -0.315724 0.005740 -55.004 < 2e-16 ***
## label19 1.191680 0.005475 217.656 < 2e-16 ***
## label20 -1.664789 0.006614 -251.715 < 2e-16 ***
## label21 -2.010901 0.006959 -288.948 < 2e-16 ***
## label22 1.173818 0.005438 215.846 < 2e-16 ***
## label23 0.308652 0.005595 55.168 < 2e-16 ***
## label24 0.443842 0.005533 80.217 < 2e-16 ***
## label25 -0.575083 0.005864 -98.076 < 2e-16 ***
## label26 -0.102328 0.005672 -18.040 < 2e-16 ***
## label27 0.750582 0.005488 136.767 < 2e-16 ***
## label28 0.782226 0.005494 142.369 < 2e-16 ***
## label29 -0.971712 0.006067 -160.162 < 2e-16 ***
## label30 -1.096469 0.006108 -179.505 < 2e-16 ***
## label31 1.310715 0.005466 239.779 < 2e-16 ***
## label32 -1.062379 0.006089 -174.475 < 2e-16 ***
## label33 -1.886142 0.006836 -275.920 < 2e-16 ***
## label34 0.274704 0.005572 49.297 < 2e-16 ***
## label35 -0.725008 0.005908 -122.716 < 2e-16 ***
## label36 1.649808 0.005385 306.388 < 2e-16 ***
## label37 1.391813 0.005431 256.265 < 2e-16 ***
## label38 0.054974 0.005644 9.741 < 2e-16 ***
## label39 0.483777 0.005519 87.662 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(hour):day_of_weekTue 18.79 18.99 106720 <2e-16 ***
## s(hour):day_of_weekWed 18.81 18.99 104138 <2e-16 ***
## s(hour):day_of_weekThu 18.78 18.99 91740 <2e-16 ***
## s(hour):day_of_weekFri 18.74 18.99 77058 <2e-16 ***
## s(hour):day_of_weekSat 18.62 18.98 58742 <2e-16 ***
## s(hour):day_of_weekSun 18.72 18.99 63191 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.843 Deviance explained = 85.2%
## -REML = 1.1876e+06 Scale est. = 1 n = 257241
Using a spline like s(hour) has increased again the
quality of the model bringing the R-Squared Adjusted up
to 84%. The deviance explained is also confirming that
the GAM approach is the best one due to the model’s ability to explain
almost all the phenomenom, probably also thanks to the generous value
k=20 offering 20 degrees of freedom.
So far, we evaluated the metrics based on the training dataset. To verify the bounty of the mdodels is necessary to test them with the dedicated testing dataset. To keep it simple, we’ll perform the checks just over the greatest models: the Interaction and GAM ones.
A “real_profile” will be added to visualize how the model predictions differ from the average data in the dataset.
# Real profile
df_hourly <- readRDS("Cache/traffic_hourly.rds")
real_profile <- df_hourly %>%
group_by(hour) %>%
summarise(real_mean = mean(total_transits, na.rm = TRUE), .groups = "drop")
rm(df_hourly)
# Prediction grid
pred_grid <- test_data %>%
select(hour, day_of_week, month, label) %>%
distinct()
pred_grid$pred_interact <- predict(interact_model, newdata = pred_grid)
pred_grid$pred_gam <- predict(gam_model, newdata = pred_grid, type = "response")
# Prediction aggregation by hour
model_profiles <- pred_grid %>%
group_by(hour) %>%
summarise(IterativePoly = mean(pred_interact, na.rm = TRUE),
GAM = mean(pred_gam, na.rm = TRUE),
.groups = "drop")
# Plot
plot_data <- model_profiles %>%
inner_join(real_profile, by = "hour") %>%
rename(Mean2024 = real_mean) %>%
pivot_longer(cols = -hour, names_to = "Source", values_to = "Transits")
p_final <- ggplot(plot_data, aes(x = hour, y = Transits, color = Source, linetype = Source)) +
geom_line(size = 1) +
geom_vline(xintercept = c(7.5, 19.5), linetype = "dashed", color = "#525252", alpha = 0.8) +
scale_color_manual(values = c("Mean2024" = "#252525",
"IterativePoly" = "#d7191c",
"GAM" = "#2c7bb6")) +
scale_linetype_manual(values = c("Mean2024" = "dot",
"IterativePoly" = "solid",
"GAM" = "solid")) +
scale_x_continuous(breaks = 0:23) +
labs(title = "Model Comparison",
x = "Hour",
y = "Hourly mean transits (per gate)",
color = "Legenda",
linetype = "Legenda") +
theme_minimal()
ggplotly(p_final)
The GAM model with the Negative Binomial family has definetly won, even if it tents to overestimate in the night and underestimate over the day. The Iterative Polynomial Regression doesn’t bend sufficiently to show the already seen “M” shape of the traffic; probably perform more interactions may improve the results, but we’ll stick to GAMs.
Just to be sure even from a empiric point of view, let’s compute the Root Mean Squared Error to observe the residuals’ standard deviation.
# Iterative Polynomial
preds_interact <- predict(interact_model, newdata = test_data)
rmse_interact <- sqrt(mean((test_data$total_transits - preds_interact)^2))
# GAM
preds_gam <- predict(gam_model, newdata = test_data, type = "response")
rmse_gam <- sqrt(mean((test_data$total_transits - preds_gam)^2))
# Print
cat("RMSE Iterative Polynomial:", round(rmse_interact, 2), "\n")
## RMSE Iterative Polynomial: 61.68
cat("RMSE GAM:", round(rmse_gam, 2), "\n")
## RMSE GAM: 46.17
We have the confirm: the GAM model performs better by getting wrong on average of 62 vehicles. Still is not perfect but, considering the traffic non-linear behavior, it’s a good result.
The increase in traffic within Milan’s Area C appears to be attributable not to a single factor, but rather to a combination of social habits, environmental policies, and external variables.
The main driver of congestion is undoubtedly the urban work routine: data show traffic peaks around the start and end times of work, while volume drops dramatically on weekends. This trend is also confirmed on a seasonal scale, with sharp reductions in traffic during holiday periods such as August and December.
Regarding the composition of the vehicle fleet, the analysis highlights how the spread of hybrid vehicles is driving the maintenance of high volumes. While regulations discourage the most polluting vehicles, facilitated or free access for hybrid cars appears to offset this decline, preventing an actual reduction in the total number of vehicles on the road.
Commercial traffic maintains a constant and less elastic presence throughout the day, being less affected by time variations than private cars.